FEAT: Add word-game option to DecompositionConverter by Raulster24 · Pull Request #2051 · microsoft/PyRIT

Raulster24 · 2026-06-18T22:13:17Z

Description

This adds an optional word-game mode to DecompositionConverter (the DrAttack decompose-and-reconstruct converter from #2003), via use_word_game: bool = False. When enabled, each harmful noun phrase is replaced by an innocuous codeword in the reconstruction questions, and a mapping preamble (for example 'apple' means 'a bomb') is established in the same prompt. This is the second half of DrAttack: it further conceals the harmful nouns by splitting them from the request behind codewords.

Off by default, so the merged converter behaviour is unchanged.

Two design choices worth flagging up front:

Inline, not a separate prepended conversation. We had discussed the word-game as a prepended/simulated conversation; I went with inline (preamble and reconstruction in one prompt) for two reasons. First, coupling: the codewords must match the reconstruction the converter builds, and a separate conversation generates its turns independently, so they cannot share the mapping without a stateful component (an attack class), which we wanted to avoid. Inline keeps it a pure converter. Second, the numbers, inline matches the two-turn version, and both are far above no word-game:

n=50, GPT-judge ASR: gpt-4o inline 44% / two-turn 46% / core 16%; gpt-4o-mini inline 52% / two-turn 62% / core 22%.
n=15 with the actual converter (not the harness), gpt-4o-mini: word-game off 20%, on 73%.

So, inline essentially keeps all of the effects on the frontier model, with no new attack class. Open to the prepended-conversation route if you prefer it.

A toggle on the converter, not a separate converter. The codewords have to stay in sync with the reconstruction this converter produces, so a separate converter cannot do it; it has to be a mode of this converter.

Note on the mechanism: the harmful phrase still appears once, in the mapping line; the concealment is that the question uses the codeword, splitting the harmful term from the request. This is the paper's word-game, and the numbers above show the lift.

(All numbers are GPT-judge refusal-bypass, not operational harm, consistent with the #2003 assessment.)

Tests and Documentation

Added unit tests: codeword substitution, off-mode unchanged, custom codewords, and a clear error when there are more noun phrases than codewords.
Documented the new use_word_game parameter in doc/code/converters/1_text_to_text_converters.py; ran JupyText --sync.
ruff check and format clean; ty reports no errors; full converter and docs test suites pass.

cc @rlundeen2 @romanlutz

adrian-gavrila

Thanks for the contribution! A few small things worth attention but overall looks great

…mposition-word-game

…overflow message

Raulster24 · 2026-06-22T07:02:50Z

@adrian-gavrila Thanks for the review. Addressed all three: codeword uniqueness is now validated in init with a test, the arg docstring is trimmed, and the overflow message states the threshold breach instead of a count.

adrian-gavrila

Looks good! Thank you for addressing the comments.

…mposition-word-game

…overflow and empty response retryable, validate empty codewords/phrases

Raulster24 · 2026-06-28T06:43:25Z

@romanlutz Thanks for the review. All three points are addressed:

The mapping is now json.dumps-serialized on both sides (ensure_ascii=False so Arabic stays readable),
so a phrase containing quotes can no longer make it ambiguous.
The noun-overflow case is validated early and raised as a retryable InvalidJsonException; ValueError
is kept for config-time issues.
No references.bib change is needed.

Following the same "model output and user config are both unpredictable" idea, I hardened a few adjacent
cases where the change surfaced:

Empty or duplicate codewords now fail fast with a clear ValueError at construction.
An empty or whitespace phrase from the model is rejected as a retryable InvalidJsonException instead of
producing a meaningless mapping.
A blocked or empty response from the decomposition target now retries instead of throwing IndexError.

Diff coverage on the changed lines is complete (escaping, Arabic, overflow recovery, empty
codewords/phrases, multi-noun ordering). ruff, ty, and the converter and docs suites pass.

FEAT: Add word-game option to DecompositionConverter

64c84fb

adrian-gavrila self-assigned this Jun 19, 2026

adrian-gavrila reviewed Jun 19, 2026

View reviewed changes

Comment thread pyrit/prompt_converter/decomposition_converter.py

Comment thread pyrit/prompt_converter/decomposition_converter.py

Comment thread pyrit/prompt_converter/decomposition_converter.py Outdated

Raulster24 added 2 commits June 22, 2026 10:58

Merge remote-tracking branch 'upstream/main' into raulster24/add-deco…

f7d00c3

…mposition-word-game

Address review: validate codeword uniqueness, trim docstring, reword …

0d78c5e

…overflow message

Raulster24 requested a review from adrian-gavrila June 26, 2026 15:02

adrian-gavrila approved these changes Jun 26, 2026

View reviewed changes

romanlutz reviewed Jun 27, 2026

View reviewed changes

Comment thread pyrit/prompt_converter/decomposition_converter.py Outdated

Comment thread pyrit/datasets/prompt_converters/decomposition/word_game_preamble.yaml

Comment thread pyrit/prompt_converter/decomposition_converter.py Outdated

Raulster24 added 2 commits June 27, 2026 17:59

Merge remote-tracking branch 'upstream/main' into raulster24/add-deco…

0f238d9

…mposition-word-game

Address review: escape word-game mapping (keep non-ASCII), make noun …

1cbe9b4

…overflow and empty response retryable, validate empty codewords/phrases

romanlutz approved these changes Jun 28, 2026

View reviewed changes

romanlutz added this pull request to the merge queue Jun 29, 2026

Merged via the queue into microsoft:main with commit a48fd9c Jun 29, 2026
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FEAT: Add word-game option to DecompositionConverter#2051

FEAT: Add word-game option to DecompositionConverter#2051
romanlutz merged 5 commits into
microsoft:mainfrom
Raulster24:raulster24/add-decomposition-word-game

Raulster24 commented Jun 18, 2026 •

edited

Loading

Uh oh!

adrian-gavrila left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Raulster24 commented Jun 22, 2026

Uh oh!

adrian-gavrila left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Raulster24 commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Raulster24 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests and Documentation

Uh oh!

adrian-gavrila left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Raulster24 commented Jun 22, 2026

Uh oh!

adrian-gavrila left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Raulster24 commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Raulster24 commented Jun 18, 2026 •

edited

Loading